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A.  Introduction 

sThe  availability  of  instruction-level  simulators  for  the  CRAY  X-MP  and  the 
CRAY-2,  together  with  early  access  to  the  MFECC  and  NAS  CRAY-2’s,  has  made 
possible  the  study  of  a  variety  of  equation-solving  issues  for  many-processor  VMP 
configurations.  These  include : 

(a)  the  development  of  equation-solving  algorithms  on  the  CRAY-2,  and  ' 

(b)  task  granularity  studies;  and 

(c)  memory  conflict  studies. 


B.  Equation-solving  on  the  CRAY-2 

The  CRAY-2  combines  the  algorithmic  requirements  of  (a)  vector  inner-loop 
syntax  and  high-level  taskability,  similar  to  the  X-MP,  and  (b)  partitioning  to  exploit  a 
local  memory  (LM)  that  is  necessitated  by  a  massive  but  slow  common  memory  (CM). 
Although  this  appears  to  be  the  most  demanding  of  any  supercomputer  architecture,  in 
fact  all  of  these  can  be  satisfied  in  equation  solution  by  block  partitioning  of  the  matrix. 

In  [3  and  [4],  the  performances  of  several  uniprocessor  CRAY-2  implementations 
were  compared.  The  above  block  partitioning  based  on  a  matrix-matrix  multiply  (M*M) 
kernel  was  shown  to  produce  a  speedup  greater  than  6:1  over  conventionally-coded 
Gauss  elimination,  and  nearly  2:1  over  the  best  previous  algorithms  [Dongarra]  based 
on  matrix-vector  multiplies.  To  date,  this  is  the  most  dramatic  evidence  on  a 
supercomputer  architecture  of  the  value  of  basing  linear  algebra  algorithms  on  an  M*M 
kernel  (also  termed  a  third-level  BLAS).  A  heretofore  problem  with  block  partitioning 
when  pivoting  is  required  was  solved  by  using  a  two-level  algorithm,  which  preserved 
the  asymptotic  performance  of  the  M*M  kernel  at  the  higher  level  and  permitted  pivot 
searces  and  row  exchanges  at  the  lower  level. 

The  same  block  partitioning  will  permit  parallel  solution  when  multitasking 
becomes  available  on  the  CRAY-2  in  May-June  1986. 

C.  Task  Granularity  Studies 

A  simulation  study  of  up  to  a  1 6-processor  X-MP  was  completed  [1].  The  effect  of 
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choosing  different  task  sizes  in  solving  a  full  set  of  equations  was  studied.  This  was  the 
largest  number  of  processors  studied  until  the  recent  availabilty  of  the 
(slow, non-vector)  message-passing  machines. 


D.  Memory  Conflict  Studies 

The  memory  conflict  problem  is  the  result  of  attempting  to  service  a 
multiprocessor  architecture  with  a  critical  uniprocessor  resource  -  main  or  common 
memory.  To  the  extent  that  conflicts  can  be  avoided,  a  major  rationale  for  having  local 
memory  -  and  the  algorithmic  complexity  it  introduces  -  can  be  avoided.  A  modest 
conflict  problem  was  observed  for  the  X-MP  in  the  range  of  5-10%  performance 
degradation;  with  the  Cray-2,  this  degradation  approaches  60%. 

The  common  memory  conflict  problem  is  largely  unappreciated  in  the  academic 
community;  thus,  it  is  possible  to  make  a  contribution  without  an  extensive  architectural 
design  background,  lit  had  previously  been  shown  that  a  critical  memory  organization 
feature  of  the  CRAY  X-MP-2  permitted  worst  case  steady-state  delays  of  25%  in  vector 
accesses  of  CM.  Based  on  this  and  similar  internal  studies,  Cray  Research 
re-designed  the  memory  system  of  the  X-MP-4.  Unfortunately,  the  new  design 
aggravated  the  startup  access  delay;  this  feature  was  documented  by  this  author  in 
[2],  which  has  led  to  a  new  design  for  future  X-MP  memories. 


II.  FUTURE  WORK 

Equation-solving  on  a  message-passing  MP,  including  the  new  INTEL  vector 
hypercube,  will  be  studied  to  develop  generic  linear  algebra  algorithms  applicable  to 
MP's  ranging  from  slow  massively-parallel  to  fast  many-processor  vector  architectures. 


III.  COUPLING  ACTIVITIES  (with  national  laboratories) 


A.  Air  Force  Flight  Dynamics  Laboratory  (FY  85-86) 

Visiting  Scientist,  studying  vectorization  of  structural  transient  optimization 
codes. 

B.  Los  Alamos  National  Laboratory  (FY  85-86) 

Consultant,  studying  memory  conflict  problems  on  a  many-processor  X-MP. 

C.  Lawrence  Livermore  National  Laboratory  (Summer,  Fall,  1985) 

Consultant,  studying  memory  conflict  problems  on  the  CRAY-2. 
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